Data Visualization before Analysis

Visualizing data as part of exploration before analysis is very important. In this way we let the data speak to us through experience rather than impose an interpretation independent of experience. The most famous illustration of the importance of visualizing data is Anscombe’s quartet, which consists of four small xy datasets with nearly identical descriptive statistical properties: mean of x = 9, variance of x = 11, mean of y = 7.5, variance of y = 4.12, correlation xy = 0.816, with a linear regression line y = 3.00 + 0.50x.

One might conclude that these datasets are very similar. Scatterplots of the four datasets, however, show how different each one is. The regression line fits one dataset well, another dataset’s points are undeniably curved, and others have powerful outliers. However, the same trendline, the linear regression line, is very similar in each case. Figure 10.27 shows the data, and Figure 10.28 shows the plots of the data.

Figure 10.27: Anscombe’s Quartet Data
Figure 10.28: Anscombe’s Quartet Charts

If we only look at the statistics for each dataset, it seems that each dataset is identical. However, when we create charts we can see that the datasets are very different. Statistics summarize data numerically, but charts can help us visualize data to spot patterns and relationships not described by the statistical measures.

The goal of Anscombe’s quartet is to show the importance of data visualization before analyzing a dataset and drawing conclusions.

Visualizing a Single Variable

The most common way of visualizing a single variable is by using a histogram (or density plot). A histogram shows how data is concentrated in the underlying distribution and the data range. Things to look for are dirty data (not in range, outliers) and shape (unimodal, multimodal) to get an idea of how many distinct populations might be in the dataset.

Visualizing Multiple Variables

The scatterplot is the most common way to visualize the relationships among multiple variables. A scatterplot usually displays two variables (but can represent up to five with x-axis, y-axis, size, color, and shape). Does the plot show a strong relationship along a straight line or a curved line, or is the data in the shape of a cluster showing a weak relationship? Bar plots can show multiple variables side by side in different colors. Box plots compare the distributions of a numeric variable grouped by a categorical variable.